Estimating Nuisance Parameters in Inverse 
Problems 



Aleksandr Y. Aravkin and Tristan van Leeuwen 

Department of Earth, Ocean and Atmospheric Sciences, The University of British 
Columbia, 6339 Stores Road, Vancouver, British Columbia Canada, V6T 1Z4. 

E-mail: {saravkin, tleeuwen}@eos.ubc.ca 

Abstract. Many inverse problems include nuisance parameters which, while not 
of direct interest, are required to recover primary parameters. Structure present in 
these problems allows efficient optimization strategies — a well known example is 
variable projection, where nonlinear least squares problems which are linear in some 
parameters can be very efficiently optimized. In this paper, we extend the idea of 
projecting out a subset over the variables to a broad class of maximum likelihood 
(ML) and maximum a posteriori likelihood (MAP) problems with nuisance parameters, 
such as variance or degrees of freedom. As a result, we are able to incorporate 
nuisance parameter estimation into large-scale constrained and unconstrained inverse 
problem formulations. We apply the approach to a variety of problems, including 
estimation of unknown variance parameters in the Gaussian model, degree of 
freedom (d.o.f.) parameter estimation in the context of robust inverse problems, 
automatic calibration, and optimal experimental design. Using numerical examples, 
we demonstrate improvement in recovery of primary parameters for several large- 
scale inverse problems. The proposed approach is compatible with a wide variety of 
algorithms and formulations, and its implementation requires only minor modifications 
to existing algorithms. 
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1. Introduction 

Many inverse problems can be formulated as optimization problems of the form 

V mmg(x,9), (1) 

x£Pc,0 

where g : M. n x WL k — > K is a twice differentiable function, X C M. n , x is a primary set of 
parameters of interest, while 9 G M. h is a secondary set of nuisance parameters, such as 
variance parameters, application-specific tuning parameters, regularization parameters, 
or degrees of freedom parameters. In many settings, k ^ n. 

A rich source of examples in (jlj is the class of separable least-squares problems, 
extensively studied over the last 40 years [Til [TOl [15] . A problem in this class is given 
by 

wm\\y - $(x)6\\l , (2) 

x,0 

where the matrix is parametrized by x. Note that g has a very special form in 

this case, and X = W 1 . For problems in this class, the major insight is to exploit the 
structure of the problem to obtain a reduced problem 

min||j,-$(z)0(a;)||*, (3) 

X 

where 

9(x) = argmin \\y — &(x)9\\% . (4) 
e 

At first glance, this does not make the problem easier to solve. However, it turns out 
that ([3j can be solved using black-box approaches as long as we re-evaluate 9{x) for any 
given x, but treat 9 as fixed whenever x is updated. The problem Q has a closed form 
solution, and as noted in [10], this approach converges much faster than optimization 
approaches to minimize the full functional ^ using descent methods for (2, 9). 

In this paper, we consider problems of type ([I]), where we can easily compute 
9{x) = argmin e g(x, 9). We show that many algorithms for solving instances of ([!]) with 
9 fixed can be easily modified to solve the joint inverse problem in x and 9. We provide 
explicit details for several important classes of problems in ([T]), including variance and 
degrees of freedom (d.o.f.) estimation, and automatic calibration of nonlinear least 
squares and robust inverse problem formulations. 

The paper proceeds as follows. In Section [2j we review the necessary theory 
underlying our approach to the entire class Q. In Section (|3|, we discuss the role 
of nuisance parameters, such as variance and degrees of freedom, in MAP estimation 
formulations. We present two important applications in detail: 

(i) Variance estimation in multiple data sets (see [5]). 

fii) Estimation of variance and d.o.f. for Student's t formulations (see [13]). 



Both are illustrated on a seismic imaging problem where the data are contaminated 
with various types of noise. 
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In Section |4j we discuss the automatic calibration problem, where the forward 
model includes a calibration parameter that needs to be estimated. We illustrate the 
approach on a seismic imaging problem where the calibration parameter are frequency- 
dependent source-weights. We discuss the application of the proposed approach to 
Optimal Experimental Design in section [5] 

Finally, we discuss other possible applications and present conclusions. 



2. General Formulation 

We consider problems of the form Q, and assume that for any given x G X, one can 
easily find 

6{x) G argmin g(x, 9) . (5) 
e 

This condition can be relaxed, and 9{x) can be considered a local minimum. Rather 
than working to solve ([IJ, we can instead focus on the reduced objective 

g{x) = g(x,9(x)) . (6) 

This approach is justified by the following theorem, adapted from jU Theorem 2]. 

Theorem 2.1 Suppose that U C W 1 and V C l fc are open, and g(x,8) is twice 
continuously differentiable onlA x V. Define the optimal value function 

g{x) = ming(x,9) . (7) 

6 

Suppose that x G U and 9 G V are such that Veg(x,9) = and Vlg{x,9) is 
positive definite. Then there exist neighbourhoods of x and 9 and a twice continuously 
differentiable function 9 :U — )■ V where 9(x) is the unique minimizer of g(x, •) on V. 
Then g(x) is twice continuously differentiable, with 

V x g(x) = V x g(x,9(x)) (8) 
Vl~g{x) = V^(x, 9{x)) + Ko9& 0(x))Vj(x) . (9) 

Remark 2.2 Theorem [$] provides sufficient conditions for existence of the first and 
second derivatives of g. In practice, these derivatives may exist even if the smoothness 
hypotheses are not satisfied. Consider g (9, x) = ^- + 9 2 — \9\x 2 . In this case, \9(x)\ = 
so g(x) = \ is smooth even though g(x, 9) is not. 



Theorem [8] suggests a natural approach to designing algorithms for minimizing g(x). 
In the unconstrained case (i.e. X is the whole space), consider iterative methods of the 
form 

x ^ = x k - lk H~ l V x g{x k ) =x k - ^H^VM^, 0(x k )) . (10) 

Specifically, H k = I yields Cauchy's steepest descent, H k = V 2 x g(x k ) yields a 
modified Newton method, and approximations to H k that use only first order derivative 
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information yield Gauss-Newton or Levenberg-Marquardt type methods. A quasi- 
Newton method such as BFGS or L-BFGS may be similarly implemented using only 
information from 

If X is a closed and bounded set that allows a simple projection, such as a set of 
box constraints ({x : I < x < u}, an ellipsoidal set {x : \\x\\m < r }, or the 1-norm ball 
({x : \\x\\i < r}), this can be exploited to solve 0. For example, we can use a modified 
projected gradient method 

x k+1 = V x [x k -^ x g{x k )\ (11) 

or an appropriately modified projected quasi-Newton method, such as the one described 
in [IB] • The point is that the structure of X does not enter into the computation of ([8]) 
or ([9]), so a natural strategy is to compute these quantities first and then apply methods 
that exploit the structure of X. Moreover, we show in the next corollary that the 
point (x, 9{x)) satisfies the first order necessary conditions for the original (constrained) 
problem. 

Corollary 2.3 Suppose the hypotheses of Theorem^hold, and the additional constraint 
x £ X is imposed, where X is a closed convex set. If x satisfies the first order necessary 
conditions for g{x), then (x, 9) with 9 = 9{x) satisfies the first order necessary conditions 
for g(x,9). 



Proof: The first order necessary conditions for Q are 

V*<7(M) = 

-V x g{x,6) e N x {x) 1 ' 

where N x (x) is the normal cone to X at the point x (see [T7j for details). The first 
order necessary condition for x to be a minimizer of the reduced objective ^ is 

-V x g{x)eN x {x) (13) 



Since we have W x g{x) = V x g(x, 6) by Theorem |8| (x, 0(x)) satisfies (12) if and only if x 
satisfies (13). On the other hand, 6{x) satisfies the first equation of (12) by construction. 

Thus, for many applications (both constrained and unconstrained), we can 
systematically extend many standard algorithms for minimizing g(x, 9) with 9 fixed 
to extended problems ([T|. This approach avoids computing the full Hessian of the 
modified objective Q, since it involves V x 9(x). 

In the next sections, we present some of these applications and provide full 
algorithmic details and numerical work. 



3. Complicating Parameters in Maximum Likelihood Estimation 

Many inverse problems can be formulated as maximum likelihood (ML) problems within 
a statistical modeling framework. Given data y, we want to solve for parameters of 
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interest x, using the fact that the parameters are related to the data via a (possibly 
nonlinear) forward model: 

d = F(x) + e . (14) 



The e term in (14) reflects a statistical model of the discrepancy between the model 
F{x) and the true data d. Independent, identically distributed (i.i.d.) Gaussian errors 
e ~ N(0, a 2 I) are a common choice, and even though the variance parameter a 2 is 
unknown, it does not affect the maximum likelihood formulation in x. This is not true 
if the data come from different sources, with each group having its own parameter of. 

More generally, may come from a range of parametric distributions. The 
Student's t distribution has been applied in many instances where large measurement 
errors are common or unexplained artifacts in the data are an issue [Tj [21 US]- These 
applications require estimates for degrees of freedom and variance parameters even with 
the i.i.d. assumption on the errors. 

If we take 9 to be unknown nuisance parameters, the general maximum likelihood 
formulation for estimating x in model (14) takes the form (fl]). The method proposed in 
this paper is well suited for online estimation of 6, and in the remainder of the section 
we provide full exposition for the multiple sources of error example and for Student's t 
hyperparameter estimation. 

3.1. Variances in Multiple Data Sets 

Estimating variance parameters in multiple datasets is an important problem in many 
areas, including drug and tracer kinetics [5], and geophysics. In this section, we review 
the formulation presented in [5], and show that the algorithm derived in [5] follows 
immediately from the general approach we propose here, i.e. it is a Gauss-Newton 



method of form (10). We present a numerical example, illustrating the importance of 
variance parameter estimation for a large-scale geophysical inverse problem. We also 
extend the approach to the (fully observed) multivariate Gaussian case with correlations 
between measurement errors. 

We are given M experiments indexed by i, each of which yields N{ measurements 
and has its own variance parameter Oi. All experiments share a common set of primary 
parameters x: 

di = Fi(x) + 6i (15) 

where di G R *, Fi(x) is the modeling operator for the i th experiment and e« ~ N(0, of I). 
If the variance parameters are fixed, the ML estimation problem for x is given by 

M l 

min y^^Wdi-F^Wl. (16) 

i=i 1 

The joint ML estimation problem for x and a 2 = {<J 2 } is given by 

- / 1 \ 

min g(x,a 2 ):=Y,[N l \og(2^ 2 ) + - 2 \\d i -F l (x)\\l) . (17) 
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This is a special example of ([TJ. 

With x fixed, (17) separates, and ^ has a closed form solution, which we find by 
taking the gradient with respect to each of and setting it to 0: 

d$(x) = ±r\\d i -F i {z)\\l (18) 

This quantity is precisely the population variance estimate. The modified problem ^ 
is now given by 

M 

mm g(x) := ^ (JV< ]og(2ira? (x)) + N { ) . (19) 
i=i 

The gradient of this objective is given by 

m 

V x g(x) = - V ——VFi{x){di - Fi(x)) , (20) 
while the Gauss-Newton (GN) Hessian approximation is given by 



H ( x ) = J2 ^VF,(a;)VF,(x) T . 

Note that this is an approximation to V^g, and completely ignores the term 
V^. e g(x,9(x))V x 8(x) in Q. The term can actually be explicitly calculated for this 
application, and turns out to be a dense negative definite correction to the Hessian 
approximation. If we ignore it, we recover the algorithm in [5], which can be seen by 
forming the GN subproblem: 

M 1 

min 5^ —r, — - Fi(x k ) - VFi(x k ) T x\\l . (21) 

x 

This expression matches [HJ (12)] up to a constant. However, while in [3] the 
function (21) came about as a cleverly constructed proxy objective for (17), we can 
now view it as a natural GN approximation to the modified objective (19). 



Example: Full Waveform Inversion Full waveform inversion (FWI) is an approach 
to obtain gridded subsurface velocity parameters from seismic data. Experiments are 
conducted by placing explosive sources on the surface and recording the reflected waves 
with an array of receivers on the surface. FWI is naturally cast as a nonlinear least 
squares optimization problem [T9~l [IB] , and fits in the framework described above. The 
data, di, in this case represents the Fourier transform of the recorded time series for 
frequency i. The corresponding modeling operator, Fi(x) = PAi(x)~ 1 Qi, inverts a 
discretized Helmholtz operator Ai(x) for the i th frequency and the gridded velocity field 
x and samples the wavefield at the receiver locations. Here, P denotes the sampling 
operator and each column of the matrix Qi is a gridded source function. 

To illustrate the approach we use a subset of the well-known Marmousi benchmark 
model, depicted in figure [T] (a). The model is discretized on a 201 x 301 grid with 10 m 
grid spacing. We generate data for 151 sources, 301 receivers (i.e., iVj = 151 x 301) — all 
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equi-spaced and located at the surface — and M — 12 frequencies between 3 and 25 Hz. 
Typically, the data has a lower signal to noise ratio for the low and high frequencies. 
To emulate this situation we add Gaussian noise to the measurements with variance 
o~i ~ (i — 6) 2 . We use an L-BFGS method to solve both the the modified optimization 



problem (19) and the original problem for a fixed o~i = 1 for all i. The results after 50 
iterations are shown in figure [ljb,c). The corresponding error between the reconstructed 
and true model is shown in figure [l](d). Finally, we show the estimated variance at 
the final model for both reconstructions in figure []Je). The reconstruction obtained 
by solving the modified problem is clearly better. Interestingly enough, the variance 
estimates for both models are almost identical. 

3.2. Correlated Multivariate Observations 

The results from the previous case can be generalized to general variance estimation in 



a multivariate inverse problem setting with correlated errors. Consider the model (15) 



where now we take q ~ N(0, E). In this case, all of the ej are in of the same dimension. 



The ML objective corresponding to (17) is given by 

M 



min g(x,E):= \Mlog(2Trdet(E)) + ^(d i -F i (x)) T E-\d i -F i (x))j .(22) 
The point here is that despite the generalization to full E, we still have a closed 



form solution analogous to (18): 

1 - 

E(x) = argmin^x, E) = — V(d< - F < (x))(d i - F^ f . (23) 

E i=l 

This can be shown by a simple derivative computation: 

sM^E) = -MS+^ r tr(5X 1 E- 1 (d j -i!i(x))(d i -F i (x)) r ) 
= -MV + 5™ 1 (d i -F i (x))(d i -F i (x)) T = . 

Therefore, the variable projection method applies immediately to (p2j), at the cost of 



computing, at each iteration in x, the standard multivariate variance estimate (23). If 
this cost is high (i.e. if each has high dimension), there are still a number of strategies 
that make the proposal feasible. For example, ( [23] ) can be computed approximately 
using a random subset of the residuals. 

In addition to improving the primary parameters, incorporating nuisance parameter 
estimation can be helpful to post-processing analysis such as uncertainty quantification. 
For example, the estimate of E at the final solution can be used to estimate the posteriori 
covariance matrix in the model space 

3. 3. Degrees of Freedom and Variance Estimation for Student 's t Formulation 

Many applications require robust formulations to obtain reasonable results with noisy 
data or in cases where a portion of the data is unexplained by the forward model (e.g. in 
the presence of coherent artifacts). A useful way to derive these formulations is to begin 
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with the statistical model (15) where the noise term is modelled using a particular 
parametric density, and then formulate the maximum a posteriori (MAP) likelihood 
problem. The least-squares formulation corresponds to a Gaussian assumption on q 
(see section 3.1), while assuming a Laplacian distribution leads to a one-norm penalty 



on the data-misfit. 

As shown by [21 Theorem 2.1], in cases where unexplained artifacts may be large 
or constitute a significant portion of the data, it is better to use heavy-tailed densities. 
A prime example is the Student's t, whose density is given by 

, 2 IN r((jfc + i)/2) / y 2 v (fc+1)/2 , N 

p(y,a 2 ,k) = /' J [1 + ^t) . 24 

This density was first successfully used in [T3] in the data fitting context. The 
degrees of freedom parameter k was seen as a tuning parameter, smoothly transitioning 
between heavy-tailed and near-Gaussian behaviour; k and a were fit using Expectation 
Maximization (EM) and scoring methods. This density was also successfully used in 
the Kalman smoothing context |Sj, where it was suggested that the EM algorithm can 
be used to fit meta-parameters. Recent work using the Student's t distribution [21 El EE] 
has side-stepped the problem, using fixed values for a and k. 

In this section, we show that the general projection approach can be used to solve 
the joint inverse problem, treating scale and degrees of freedom as nuisance parameters. 
We propose a novel simple method, different from EM or scoring methods discussed 
in |13| . for estimating scale and degrees of freedom for any given set of residuals. Given 



the model (15), the full MAP Student's t estimation problem is given by 

ndnrt*,**) := -nlog (l + ,(25) 

where = di — F^x). Following the philosophy presented in the paper, we solve the 
problem by defining the modified objective 

g{x) = g(x,a 2 (x),k(x)) (26) 

with 

(a 2 (x), k(x)) = argmm g(x, a 2 , k) . 

The two-dimensional optimization problem in (a 2 , k) required to evaluate g(x) can be 
solved using a customized routine or a black-box optimization code. An application is 
presented below. 



Example: Traveltime tomography: We consider a cross-well traveltime tomography 
problem. In this case, sources and receivers are placed in vertical wells and the 
data consists of picked traveltime of first arrivals. Since the data are typically very 
noisy, a portion of the traveltimes may be picked erroneously, motivating the use of 
robust penalties for the inversion. The traveltimes are computed by a geometric optics 
approach, where wave propagation is modeled via rays. The traveltime between a 
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given source and receiver is simply the integral of the reciprocal velocity along the 
corresponding ray-path. By assuming small perturbations of a known background 
velocity, we arrive at a linear modeling operator with a fixed ray geometry. The data 
are the traveltime perturbations, while the primary parameter of interest is the velocity 
perturbation, both taken with respect to a known background model. 

In this example, we consider a constant background velocity, so that the ray 
paths are straight lines. The modeling operator is therefore essentially a Radon 
transform, which is often used in medical X-ray imaging applications. The true velocity 
perturbation is discretized on a 51 x 51 grid and is shown in figure [2^a). The 
corresponding data for 51 sources and receivers and the added outliers are shown in 
figure |2^b). We regularize the inversion by inverting for the primary parameters on a 
courser grid of size 26 x 26. We then interpolate back to the fine grid using 2D cubic 
interpolation. The modified optimization problem is now given by 

mmp (AT - Ax), (27) 

X 

where AT G M. are the measured traveltime perturbations, x G IR 676 is the 
velocity perturbation, A is the modeling operator which combines the Radon transform 



and interpolation, and 9 = (o- 2 ,k) is obtained by solving (26) using a Nelder-Mead 
method [33]. 



Note that we may treat 9 as fixed when designing an algorithm to solve (27), as 



long as the parameters are re-estimated at every evaluation of p${r) and its derivatives. 



To solve (27) we use a modified Gauss- Newton algorithm which calculates the updates 
by solving 

(A T H e (r k )A) Ax k = A T Vp e (r k ), (28) 

where r k = AT — Ax k and H e is a positive approximation of the Hessian V 2 p<9- We 
solve the subproblems using CG. Note that when p = \\ ■ || 2 , the algorithm converges in 
one GN iteration, which is computed by applying CG to the normal equations. 

We compare the following three approaches: i) least-squares, shown in figure |3](a), 
ii) Student's t with a fixed 9 wich is estimated once at the initial residual, shown 
in figure |3](b), and Hi) Student's t where we estimate 9 at each iteration, shown in 
figure [3](c) . In order to understand the difference between the latter two cases, we show 
histograms of the initial and final residuals as well as the influence function for the 
corresponding 9 in figures |4^a-c). Clearly, re-fitting the 9 at each iteration allows the 
inversion to home in on the good data while ignoring the outliers. 

4. Automatic calibration 

In this section we consider the case where the forward model includes a calibration factor 
a: 

d = F(x,a) + e. (29) 
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In case of the non-linear data-fitting problem described earlier, the modified 
objective is given by 

mmg(x) = p(d — F(x, a)), (30) 

X 

where 

a (x) = argmin p(d — F(x, a)). (31) 

a 

The motivating example that led us to consider this class of problems is presented 
below. 

Example: FWI with source estimation: Seismic data can be interpreted as the Green's 
function of the subsurface, parametrized by x, convolved with an unknown (bandlimited) 
source signature. In the frequency-domain we can model the uknown source signature 
by multiplication with a complex scalar for each frequency. The problem of interest is 
now formulated as 

M \ 

g{x,a) ;=^2p(di- aiFi(x))\ , (32) 

where the index i runs over frequency. Just as in the variance parameter case, the 
parameters «j are linked only through the parameters x, and for a given x the problem 
decouples completely, giving 

«i(x) = argmin p{di — a«Fj(x)) . (33) 

O-i 

We consider the least-squares and Student's t penalty, and use a scalar Netwon-type 



mm 



method to solve (33): 



a „ +1 = a „_ (Vp(r-(x)),F,(x)) 

1 (F(x),HF t (x)) ' [6V 
where r\{x) = di — a^F^x), Vp is the gradient of the penalty function and H is (a 
positive definite approximation of) the Hessian V 2 p- In particular, we have: 

• least-squares: p(r) = \ J2i r h ^ Pi = r i an d Ha = 1. 

• Student's t: p(r) = \ £\ log(fc + rf ), V ' Pi = n/(k + rf ) and H u = l/(k + r?). 

For more details on the Student's t approach we refer to [2]. 

We generate seismic data for the velocity model depicted in figure [5] (a) with a 
time-domain finite difference code. The data consists of 141 sources and 281 receivers 
and has a recording time of 4 seconds. 10 percent of the data is corrupted with large 
outliers. 

We invert the data in several stages, moving from low to high frequencies. Each 
stage uses only a few frequencies and the output is used as initial guess for the subsequent 
stage. This is a well-known strategy in FWI to avoid local minima [6]. We use an L- 
BFGS method to solve the resulting optimization problems, starting from the initial 
model shown in figure [5] (b). The results are shown in figure [5] (c,d). The Student's t 
approach recovers the most important features of the model whereas the least-squares 
approach leads to a very noisy model. 
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5. Optimal experimental design 

In Optimal Experimental Design one is concerned with finding optimal design 
parameters 9 for which a set of test models {x{\ can be recovered from the corresponding 
simulated data di(9) = F(xi,9) + e. This can be formulated as an optimization 
problem (cf. p2] and references cited therein) 

wmJ2Q{xi(0),Xi) + C{9), x l (9) = &rgmm\\F(x,9)-d l (9)\\ 2 . (35) 

i 

Here, Q(xi, Xi) measures the quality of the reconstruction (lower is better) and C{9) 



measures the cost of a given experimental parameter setting. Note that (35) is actually 
the reduced problem for the joint optimization problem 



mm 



\\F(x,9) - di(9)\\l + ^g(x,x 4 ) + C(9), (36) 



where x has been projected out. 

The downside to this approach is that projecting out x is expensive, since every 
iteration of any algorithm to find Xi(9) requires repeated evaluations of F(x, 9), for each 
model in the class {xi}. Rather than projecting out x, we can project out the design 
parameter 9 to arrive at a different reduced objective 

mmg(x) := ^ \\F(x,9(x)) - di{9{x))\\l + ^Q(x,x 4 ) + C(9(x)), (37) 

i i 

where 9(x) = axgmin | \F(x, 9)— di{9)\ \l+C(9). In many cases, 9(x) can be computed 
cheaply without re-evaluating the whole forward model. The forward modeling need 
only be done when x is updated, as in the other applications that we presented. For 
example, 9 may represent a vector of source weights for waveform inversion in which 



case F(x,9) = PA(x)~ 1 Qd\ag(9) (see section 3.1). Since the data are linear in the 
source, we need only invert the Helmholtz system once (for a given x), and therefore 9 
can be computed relatively cheaply. A reasonable penalty on 9 might be the one-norm, 
in which case we are looking for a setup with as few sources as possible. Alternatively, 
we can impose a two-norm penalty on 9 to find a setup where the sources require the 
least amount of energy. 



6. Discussion and Conclusions 



Many inverse problems involve nuisance parameters that are not of primary interest 
but can have significant influence on the estimation of primary parameters. Common 
examples include variance, degree of freedom, and calibration parameters. These issues 
arise in a great variety of applications, including pharmacokinetic modeling [5], seismic 
inverse problems [19], dynamic systems [7j, uncertainty quantification |9] and optimal 
experimental design [T2] . 

In this paper, we proposed a straightforward approach to fitting these nuisance 
parameters on the fly, while solving the overall inverse problem. Specifically, we 
formulated the problem as a joint optimization over primary parameters x and nuisance 
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parameters 6 ([T]), and showed that for a large class of problems, one can simply project 
out the 9 parameters by solving (|5]). In this least squares case, this idea has been 
carefully studied under the name variable projection [151 [TO]. As we showed, these ideas 
extend nicely to the entire class (jl]). In particular, Theorem 2.1 and Corollary 2.3 



characterize the general approach and are the basis for algorithm design of first and 
second order methods. 

An immediate consequence of the work is the ability to modify first and second 
order algorithms that exploit particular application structure to also fit nuisance pa- 
rameters. We demonstrated this in practice using several (large scale) inverse problems: 



Application 


Complicating parameters 


Algorithm 


full waveform inversion 


variances in multiple datasets 


L-BFGS 


travel time tomography 


student's t parameters 


Gauss-Newton with CG 


automatic calibration 


unknown source amplitudes 


L-BFGS 



In the case of variances in multiple datasets, the proposed approach matches the 
algorithm proposed in [5], and therefore the development we presented provides an 
alternative (and significantly simpler) derivation. We have also shown that the approach 
can be easily extended to estimate covariances between error sources in the case where 
we have multivariate observations in Section 13.21 

In the case of student's t parameters, it is interesting to note that when estimating 
degrees of freedom for fixed residuals, our approach matches the one used in the MASS 
library of the R programming language [20]. To our knowledge, this approach has 
not been used for fitting degrees of freedom in general inverse problems, and in fact 
Lange, Little, and Taylor [13J, who first proposed Student's t inversion, advocated a 
very different (EM-type) approach for degrees of freedom fitting. 

From a theoretical point of view, the method we propose can be used to solve a 
variety of inverse problems from the general class Q. From a practical point of view, 
the main selling point of the proposed approach is the ability to modify existing methods 
to solve for nuisance parameters on the fly. 
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Figure 1. Results for variance estimation, (a) True model, (b) result after 50 
iterations for fixed constant variance Oi — 1 and (c) result after 50 iterations with 
variance estimation. The sample- variance for the latter two (red and blue respectively) 
results as well as the true variance (dashed line) is shown in (d). Finally, (e) shows the 
relative model error for each iteration for fixed (red) and estimated (blue) variance. 




Figure 2. (a) velocity perturbation in m/s used to generate the observed data, (b) 
shows the corresponding traveltime perturbations in black and the outliers in red. 
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Figure 3. Results for traveltime tomography, (a) least-squares reconstruction, 
(b) Student's t reconstruction with fixed estimated at the initial residual and (c) 
Student's t reconstruction where 6 is re-estimated at every iteration. 
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Figure 4. Histograms of the residuals and corresponding influence functions p' g . (a) 
initial residual, (b) final residual corresponding to figure [3] (b) and (c) final residual 
corresponding to figure [3] (c). In the latter case the parameters 6 are re-estimated 
at each iteration, allowing the inversion to home in on the good data and ignore the 
outliers. 




Figure 5. (a) True velocity model, (b) Initial velocity model, (c) Least-squares 
reconstruction from noisy data, (d) Student's t reconstruction from noisy data. 



